Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add map_deletions_to_ts as Dataset method #429

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hyanwong
Copy link
Contributor

@hyanwong hyanwong commented Dec 3, 2024

This seemed like the neatest API for mapping deletions, as we require a Dataset object, and can add methods to it easily enough, i.e.

ds = sc2ts.Dataset("../data/viridian_2024-04-29.alpha1.zarr.zip")
ts = tszip.load("../data/find_problematic_v2-2023-01-01.ts.tsz")

start = 11284
end = 11302
del_ts = ds.map_deletions_to_ts(ts, start, end)

I have coded it so that the first sample could be Wuhan, or not. I have also put a stub for a test in, but I'm not sure how to actually test it, as I don't know if we have a test tree sequence equivalent of fx_dataset with the same named samples.

@jeromekelleher
Copy link
Owner

Nice, thanks @hyanwong.

I think we'd probably arrange the API a bit differently, in that what we'll probably want to do is remap all deletions that pass a specific frequency threshold, so we'll likely pass in a list of site IDs rather than one range. I'd also like to add some metadata so we can track these mutations more easily in analysis.

Leave it with me and I'll rejig and write tests when I get a chance.

@hyanwong
Copy link
Contributor Author

hyanwong commented Dec 4, 2024

Great. Re ranges, given the issues I just had with alignments (and after chatting to Isobel), it does seem worth including a short portion of flanking regions too: i.e. don't believe the site positions / ids are completely accurate w.r.t. deletions.

I was pleasantly surprised that when I remapped the range of 11280 to 11305, counting only "significant" mutations that lead to more than 50 samples, the only regions with deletions were 11283-11296, as discussed in jeromekelleher/sc2ts-paper#249 (comment). That implies that significant deletions might actually be quite rare, and we might be able to pass in all the sites as a first approximation, then narrow down to only those with significant deletions.

@jeromekelleher
Copy link
Owner

My guess is that if we do something like including only sites with > 10% frequency of deletions (or something) we'll get a very good approx. We track this in the site QC of the ARG:

Screenshot from 2024-12-04 14-34-19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants